Frame

The client bank XYZ is running a direct marketing campaign. It wants to identify customers who would potentially be buying their new term deposit plan.

Acquire

Data is obtained from UCI Machine Learning repository. http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

Data from direct marketing campaign (phone calls) of a Portuguese Bank is provided.

Attribute Information:

bank client data:

  1. age (numeric)
  2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
  3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
  4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
  5. default: has credit in default? (categorical: 'no','yes','unknown')
  6. housing: has housing loan? (categorical: 'no','yes','unknown')
  7. loan: has personal loan? (categorical: 'no','yes','unknown')
  1. contact: contact communication type (categorical: 'cellular','telephone')
  2. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  3. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
  4. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

  1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  3. previous: number of contacts performed before this campaign and for this client (numeric)
  4. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes

  1. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  2. cons.price.idx: consumer price index - monthly indicator (numeric)
  3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  4. euribor3m: euribor 3 month rate - daily indicator (numeric)
  5. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: 'yes','no')

The given data is randomly divided into train and test for the purpose of this workshop. Build the model for train and use it to predict on test.

Explore


In [1]:
#Import the necessary libraries
import numpy as np
import pandas as pd

In [2]:
#Read the train and test data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

Exercise 1

print the number of rows and columns of train and test


In [16]:



(35211, 17) (10000, 17)

Exercise 2

Print the first 10 rows of train


In [4]:



Out[4]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

Exercise 3

Print the column types of train and test. Are they the same in both train and test?


In [5]:
#train


Out[5]:
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object

In [6]:
#test


Out[6]:
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object

In [7]:
#Are they the same?

In [64]:
#Combine train and test
frames = [train, test]
input = pd.concat(frames)

In [9]:
#Print first 10 records of input


Out[9]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

Exercise 4

Find if any column has missing value There is a pd.isnull function. How to use that?


In [12]:



Out[12]:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
deposit      0
dtype: int64

In [65]:
#Replace deposit with a numeric column
#First, set all labels to be 0
input.at[:, "depositLabel"] = 0
#Now, set depositLabel to 1 whenever deposit is yes
input.at[input.deposit=="yes", "depositLabel"] = 1

In [ ]:

Exercise 5

Find % of customers in the input dataset who have purchased the term deposit


In [72]:



Out[72]:
11.698480458295547

In [75]:
#Create the labels 
labels = 
labels


Out[75]:
0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
9970    1
9971    1
9972    1
9973    1
9974    1
9975    1
9976    1
9977    1
9978    1
9979    1
9980    1
9981    1
9982    1
9983    1
9984    1
9985    1
9986    1
9987    1
9988    1
9989    1
9990    1
9991    1
9992    1
9993    1
9994    1
9995    1
9996    1
9997    1
9998    1
9999    1
Name: depositLabel, dtype: int64

In [83]:
#Drop the deposit column 
input.drop(["deposit", "depositLabel"], axis=1)

Exercise 6

Did it drop? If not, what has to be done?

Exercise 7

Print columnn names of input


In [ ]:


In [85]:
#Get list of columns that are continuous/integer
continuous_variables = input.dtypes[input.dtypes != "object"].index

In [86]:
continuous_variables


Out[86]:
Index([u'age', u'balance', u'day', u'duration', u'campaign', u'pdays',
       u'previous'],
      dtype='object')

In [87]:
#Get list of columns that are categorical
categorical_variables = input.dtypes[input.dtypes=="object"].index

In [88]:
categorical_variables


Out[88]:
Index([u'job', u'marital', u'education', u'default', u'housing', u'loan',
       u'contact', u'month', u'poutcome'],
      dtype='object')

Exercise 8

Create inputInteger and inputCategorical - two datasets - one having integer variables and another having categorical variables


In [89]:
inputInteger =

In [91]:
#print inputInteger
inputInteger.head()


Out[91]:
age balance day duration campaign pdays previous
0 58 2143 5 261 1 -1 0
1 44 29 5 151 1 -1 0
2 33 2 5 76 1 -1 0
3 47 1506 5 92 1 -1 0
4 33 1 5 198 1 -1 0

In [93]:
inputCategorical =

In [94]:
#print inputCategorical
inputCategorical.head()


Out[94]:
job marital education default housing loan contact month poutcome
0 management married tertiary no yes no unknown may unknown
1 technician single secondary no yes no unknown may unknown
2 entrepreneur married secondary no yes yes unknown may unknown
3 blue-collar married unknown no yes no unknown may unknown
4 unknown single unknown no no no unknown may unknown

In [101]:
#Convert categorical variables into Labels using labelEncoder

inputCategorical = np.array(inputCategorical)

Exercise 9

Find length of categorical_variables


In [102]:



Out[102]:
9

In [119]:
#Load the preprocessing module
from sklearn import preprocessing

In [103]:
for i in range(len(categorical_variables)):
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(inputCategorical[:,i]))
    inputCategorical[:, i] = lbl.transform(inputCategorical[:, i])

In [105]:
#print inputCategorical

Exercise 10

Convert inputInteger to numpy array


In [107]:
inputInteger = 
inputInteger


Out[107]:
array([[  58, 2143,    5, ...,    1,   -1,    0],
       [  44,   29,    5, ...,    1,   -1,    0],
       [  33,    2,    5, ...,    1,   -1,    0],
       ..., 
       [  69,  247,   22, ...,    2,   -1,    0],
       [  48,    0,   28, ...,    2,   -1,    0],
       [  31,  131,   15, ...,    1,   -1,    0]])

Exercise 11

Now, create the inputUpdated array that has both inputInteger and inputCategorical concatenated

Hint Check function called vstack and hstack


In [ ]:


In [118]:
inputUpdated.shape


Out[118]:
(45211, 16)

Train the model

Model 1: Decision Tree


In [125]:
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot

In [126]:
bankModelDT = tree.DecisionTreeClassifier(max_depth=2)

In [127]:
bankModelDT.fit(inputUpdated[:train.shape[0],:], labels[:train.shape[0]])


Out[127]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [128]:
dot_data = StringIO() 
tree.export_graphviz(bankModelDT, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("bankDT.pdf")


Out[128]:
True

In [129]:
#Check the pdf

Exercise 12

Now, change the max_depth = 6 and check the results.

Then, change the max_depth= None and check the results


In [ ]:


In [144]:
# Prediction
prediction_DT = bankModelDT.predict(inputUpdated[train.shape[0]:,:])

In [133]:
#Compute the error metrics

In [134]:
import sklearn.metrics

In [135]:
sklearn.metrics.auc(labels[train.shape[0]:], prediction_DT)


Out[135]:
0.5

In [136]:
#What does that tell?

In [137]:
#What's the error AUC for the other Decision Tree Models

Exercise 13

Instead of predicting classes directly, predict the probability and check the auc


In [ ]:


In [142]:
sklearn.metrics.auc(labels[train.shape[0]:], prediction_DT[:,0])


Out[142]:
0.54849867669154428

Accuracy Metrics

  • AUC
  • ROC
  • Misclassification Rate
  • Confusion Matrix
  • Precision & Recall

Confusion Matrix

Calculate True Positive Rate

TPR = TP / (TP+FN)

Calculate False Positive Rate

FPR = FP / (FP+TN)

Precision

Recall


In [147]:
#Precision and Recall

In [145]:
sklearn.metrics.precision_score(labels[train.shape[0]:], prediction_DT)


Out[145]:
0.57177033492822971

In [146]:
sklearn.metrics.recall_score(labels[train.shape[0]:], prediction_DT)


Out[146]:
0.20427350427350427

Random Forest


In [148]:
from sklearn.ensemble import RandomForestClassifier

In [157]:
bankModelRF = RandomForestClassifier(n_jobs=-1, oob_score=True)

In [158]:
bankModelRF.fit(inputUpdated[:train.shape[0],:], labels[:train.shape[0]])


Out[158]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [156]:
bankModelRF.oob_score_


Out[156]:
0.89128397375820057

Exercise 14

Do the following

  1. Predict on test
  2. Find accuracy metrics: AUC, Precision, Recall
  3. How does it compare against Decision Tree

In [ ]:

Gradient Boosting Machines


In [160]:
import xgboost as xgb

In [176]:
params = {}
params["min_child_weight"] = 3
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 1
params["silent"] = 0
params["max_depth"] = 4
params["nthread"] = 6
params["gamma"] = 1
params["objective"] = "binary:logistic"
params["eta"] = 0.005
params["base_score"] = 0.1
params["eval_metric"] = "auc"
params["seed"] = 123

In [177]:
plst = list(params.items())
num_rounds = 120

In [178]:
xgtrain_pv = xgb.DMatrix(inputUpdated[:train.shape[0],:], label=labels[:train.shape[0]])
watchlist = [(xgtrain_pv, 'train')]
bankModelXGB = xgb.train(plst, xgtrain_pv, num_rounds)

In [179]:
prediction_XGB = bankModelXGB.predict(xgb.DMatrix(inputUpdated[train.shape[0]:,:]))

In [180]:
sklearn.metrics.auc(labels[train.shape[0]:], prediction_XGB)


Out[180]:
0.19817152619361877

Another way of encoding

One Hot Encoding

Whiteboard !


In [175]:
inputOneHot = pd.get_dummies(input)

Exercise 15

On the one hot encoded data, train

  1. Decision Tree
  2. Random Forest
  3. xgboost

Which one works best on the test dataset?


In [ ]: